Add dat bypass, build-cluster-db CLI, and incremental improvements#10
Add dat bypass, build-cluster-db CLI, and incremental improvements#10
Conversation
Normalizes both legacy msnet and QPX parquet formats to canonical column names at load time, extracts global_qvalue from QPX additional_scores, and adds a standalone cluster2parquet command for the cluster-parquet step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…head BestSpectrumStrategy now uses groupby.idxmin() instead of groupby.apply(top_n_rows) for selecting the best spectrum per cluster (~27x measured speedup). Base class uses sort_values + groupby.head() for top-N selection in large clusters (~76x measured speedup). Also replaces np.isin with .isin() and np.vstack with pd.concat for cleaner DataFrame operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously only the first parquet file was processed when a list was passed. Now iterates over all files, accumulating cluster results across the full dataset. State (write counts, file indices) persists across files so MGF path/order mapping remains consistent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements three-phase incremental clustering for adding new projects
without re-clustering the entire dataset:
1. extract-reps: Writes one representative spectrum per cluster from
cluster_metadata.parquet to MGF (TITLE=rep:{cluster_id} for tracing)
2. resolve-clusters: After MaRaCluster runs on reps + new data, maps
new cluster IDs back to existing ones via representative membership.
Handles cluster merges (multiple old clusters → one new).
3. merge-results: Appends new PSMs to membership table (dedup by USI),
rebuilds cluster metadata with updated stats and consensus spectra.
CLI: pyspectrafuse incremental extract-reps / merge-clusters
Tests cover extraction, index building, resolution, and merge scenarios.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add parquet-to-dat conversion (maracluster_dat.py) bypassing MGF entirely (15x smaller) - Add build-cluster-db command for dat-bypass workflow - Add parquet2dat CLI command - Expand incremental merge-clusters with BIN strategy support - Update cluster2parquet for QPX metadata compatibility - Register new CLI commands in pyspectrafuse.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds a cluster database pipeline, incremental clustering tools, QPX metadata ingestion and schema normalization, new CLI commands (build-cluster-db, cluster2parquet, convert-to-qpx, convert-dat, incremental subcommands), improved consensus strategies, DAT conversion support, and related tests and packaging/docs updates. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as build-cluster-db CMD
participant TSV as Cluster TSV Parser
participant Titles as ScanTitles Parser
participant Parquet as Parquet Scanner
participant Schema as ParquetSchemaAdapter
participant PSM as PSM Membership Builder
participant Consensus as Consensus Builder
participant Output as Parquet Writer
User->>CLI: invoke with cluster_tsv, scan_titles, parquet_dirs
CLI->>TSV: parse_cluster_tsv()
TSV-->>CLI: cluster assignments (mgf_basename, scannr, cluster_id)
CLI->>Titles: parse_scan_titles()
Titles-->>CLI: scan → run_file, orig_scan, peptidoform, charge
CLI->>Parquet: scan parquet_dirs for scalar fields
Parquet->>Schema: available_columns() + adapt()
Schema-->>Parquet: normalized rows
Parquet-->>CLI: scalar values (PEP, q-value, mz)
CLI->>PSM: merge cluster assignments + scan info + scalars
PSM->>PSM: compute per-cluster stats (count, best PEP/q, purity)
PSM->>Output: write psm_cluster_membership.parquet
CLI->>Parquet: stream mz/intensity arrays per cluster
Parquet->>Consensus: build consensus spectra
Consensus-->>PSM: consensus arrays
PSM->>Output: write cluster_metadata.parquet
Output-->>User: return output paths
sequenceDiagram
actor User
participant CLI as merge-clusters CMD
participant Resolver as resolve_incremental_clusters()
participant RepMGF as Representative MGF parser
participant IDMap as ID Mapper
participant ClusterRes as ClusterResHandler
participant Combiner as CombineCluster2Parquet
participant Merger as append_psm_membership()
participant Rebuilder as rebuild_cluster_metadata()
participant Output as Parquet Writer
User->>CLI: invoke with cluster_tsv, rep_mgf, new_parquet_dir
CLI->>Resolver: resolve_incremental_clusters()
Resolver->>RepMGF: build_representative_index()
RepMGF-->>Resolver: mapping rep order → old_cluster_id
Resolver->>IDMap: correlate new_cluster_id → old_cluster_id
Resolver-->>CLI: id_map + new_spectra_df
CLI->>ClusterRes: get_cluster_dict(new_cluster_tsv)
ClusterRes-->>CLI: cluster_accession → mgf_path lookup
CLI->>Combiner: inject_cluster_info(new_parquets, sample_info_dict)
Combiner-->>CLI: enriched_df with cluster_accession
CLI->>Merger: append_psm_membership(existing_membership, new_df)
Merger-->>CLI: merged membership parquet
CLI->>Rebuilder: rebuild_cluster_metadata(merged_membership,...)
Rebuilder-->>Output: updated cluster_metadata.parquet
CLI-->>User: summary + output paths
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Lead with convert-dat and build-cluster-db as recommended commands - Add incremental clustering and QPX metadata to feature list - Update project structure with new modules - Mark MGF conversion and SDRF handling as legacy - Update version to 0.0.4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (2)
pyspectrafuse/commands/parquet2dat.py (1)
35-39: Move parquet discovery out of thespectrum2mspcommand module.This command now depends on another CLI command for a generic filesystem helper. That makes command imports bidirectional over time and duplicates logic that already belongs in
pyspectrafuse.common.parquet_utils. Pull the helper intocommonand reuse it from both commands.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/commands/parquet2dat.py` around lines 35 - 39, The parquet discovery helper find_target_ext_files is being imported from pyspectrafuse.commands.spectrum2msp causing a cross-command dependency; move or reuse the generic helper from pyspectrafuse.common.parquet_utils instead: remove the import of find_target_ext_files from pyspectrafuse.commands.spectrum2msp in parquet2dat.py and import the equivalent function (or add a new helper) from pyspectrafuse.common.parquet_utils, then call that helper to produce parquet_files and delete any duplicated discovery logic so both spectrum2msp and parquet2dat rely on the single common helper (update exports in parquet_utils if needed).pyspectrafuse/commands/spectrum2msp.py (1)
40-53: Return parquet files in a stable order.
glob.glob()does not guarantee a reproducible traversal order. Sorting here keeps downstream behavior deterministic —convert-datincrementsfile_idxper discovered file, and this command also uses the first entry for output naming.Proposed fix
- for path_str in glob.glob(str(Path(directory) / f'**/*{extensions}'), recursive=True): + for path_str in sorted(glob.glob(str(Path(directory) / f'**/*{extensions}'), recursive=True)):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/commands/spectrum2msp.py` around lines 40 - 53, glob.glob() can return files in nondeterministic order; make the discovery deterministic by sorting the found paths. Change the loop over glob.glob(...) to iterate over sorted(glob.glob(...)) or sort the collected list `res` (e.g., call `res.sort()` before returning) so that the order of `path_str` items and the resulting `res` list is stable for downstream `file_idx` increments and output naming; update code around the for-loop that builds `res` (referencing `path_str`, `file`, and `res`) to ensure the sorted behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyspectrafuse/commands/build_cluster_db.py`:
- Around line 272-275: The groupby().idxmin() call can fail when a cluster's
posterior_error_probability is all NaN; before computing best_idx, coerce
posterior_error_probability to numeric on the enriched DataFrame and replace
NaNs with np.inf (e.g., set enriched['_pep'] =
pd.to_numeric(enriched['posterior_error_probability'],
errors='coerce').fillna(np.inf) or call enriched['_pep'].fillna(np.inf,
inplace=True)) so that enriched.groupby('cluster_id')['_pep'].idxmin() returns
valid positions; then compute best_idx and select best_rows as before (symbols:
enriched, '_pep', posterior_error_probability, idxmin, best_idx, best_rows).
In `@pyspectrafuse/commands/quantmsio2mgf.py`:
- Around line 46-53: The worker pool is being created with
Pool(processes=os.cpu_count()) which ignores the --task_parallel option; change
the Pool creation to use the task_parallel value (i.e.
Pool(processes=task_parallel)) so that Parquet2Mgf.convert_to_mgf_task actually
runs with the requested number of worker processes, and keep the existing
chunksize parameter for batching (or compute a separate chunksize variable) when
calling pool.imap to control task batching rather than concurrency.
In `@pyspectrafuse/commands/spectrum2msp.py`:
- Line 198: The code currently calls ParquetPathHandler(...).get_item_info()
with the first parquet file path (path_parquet_lst[0]), which extracts the
accession from the last path segment and therefore yields the parquet filename
instead of the project basename; change the call to pass the project directory
(e.g., the parent directory of the parquet files or the existing project_dir
variable) into ParquetPathHandler so get_item_info() reads the directory name as
the accession; update the reference at the call site (where basename is set) to
construct the handler with the folder containing the parquet files rather than
the file path itself.
In `@pyspectrafuse/common/parquet_utils.py`:
- Around line 58-68: The loop building parquet_path_lst currently appends any
glob hit (parquet_file) without checking whether it's a file, so partitioned
datasets stored as directories ending in .parquet get returned; update the loop
in this helper to skip directories by calling parquet_file.is_file() (or
equivalently ensure pathlib.Path(parquet_file).is_file()) before applying the
existing suffix/name/parts filters and appending to parquet_path_lst, leaving
all other filtering (checks against _metadata_suffixes, _exclude_names,
_exclude_dirs) unchanged.
In `@pyspectrafuse/incremental/resolve_clusters.py`:
- Around line 99-103: The current filtering uses rep_mgf_basename and
str.contains, which yields false positives; instead compute exact basenames
using pathlib.Path and do an equality match: derive rep_mgf_basename from
Path(rep_mgf_path).name, compare against cluster_df['mgf_path'].map(lambda p:
Path(p).name) (or add a temporary column of basenames) to build rep_mask, then
set rep_rows/new_rows accordingly; ensure pathlib.Path is imported if not
already.
In `@pyspectrafuse/mgf_convert/parquet2mgf.py`:
- Around line 155-163: _get_mgf_path currently always pulls instrument from
sample_info_dict, ignoring the skip_instrument option; update _get_mgf_path to
check the skip_instrument flag (the variable/attribute used by Parquet2Mgf for
that option) and, when True, use a single shared instrument bucket (e.g., a
fixed name like "all_instruments" or an empty instrument segment) instead of
info[0]; modify the function where mgf_group_df['mgf_file_path'] is set so
_get_mgf_path reads skip_instrument and constructs the path accordingly
(reference function name _get_mgf_path, the sample_info_dict lookup, and
Parquet2Mgf.skip_instrument or the local skip_instrument variable).
In `@tests/test_incremental.py`:
- Around line 150-155: The fixture string assignments for the variable
tsv_content use unnecessary f-string prefixes (e.g., f"reps.mgf\t0\tNEW_A\n"
blocks) which trigger Ruff F541; remove the leading f from each quoted literal
(including the second occurrence around lines 193-196) so they become plain
string literals while preserving the content and escape sequences.
---
Nitpick comments:
In `@pyspectrafuse/commands/parquet2dat.py`:
- Around line 35-39: The parquet discovery helper find_target_ext_files is being
imported from pyspectrafuse.commands.spectrum2msp causing a cross-command
dependency; move or reuse the generic helper from
pyspectrafuse.common.parquet_utils instead: remove the import of
find_target_ext_files from pyspectrafuse.commands.spectrum2msp in parquet2dat.py
and import the equivalent function (or add a new helper) from
pyspectrafuse.common.parquet_utils, then call that helper to produce
parquet_files and delete any duplicated discovery logic so both spectrum2msp and
parquet2dat rely on the single common helper (update exports in parquet_utils if
needed).
In `@pyspectrafuse/commands/spectrum2msp.py`:
- Around line 40-53: glob.glob() can return files in nondeterministic order;
make the discovery deterministic by sorting the found paths. Change the loop
over glob.glob(...) to iterate over sorted(glob.glob(...)) or sort the collected
list `res` (e.g., call `res.sort()` before returning) so that the order of
`path_str` items and the resulting `res` list is stable for downstream
`file_idx` increments and output naming; update code around the for-loop that
builds `res` (referencing `path_str`, `file`, and `res`) to ensure the sorted
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2b66a7c3-de12-4302-8ff6-4d095e94000f
📒 Files selected for processing (27)
pyspectrafuse/cluster_parquet_combine/cluster_res_handler.pypyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.pypyspectrafuse/commands/build_cluster_db.pypyspectrafuse/commands/cluster2parquet.pypyspectrafuse/commands/convert_msnet_to_qpx.pypyspectrafuse/commands/incremental.pypyspectrafuse/commands/parquet2dat.pypyspectrafuse/commands/quantmsio2mgf.pypyspectrafuse/commands/spectrum2msp.pypyspectrafuse/common/constant.pypyspectrafuse/common/msp_utils.pypyspectrafuse/common/parquet_utils.pypyspectrafuse/common/qpx_metadata.pypyspectrafuse/common/sdrf_utils.pypyspectrafuse/consensus_strategy/average_spectrum_strategy.pypyspectrafuse/consensus_strategy/best_spetrum_strategy.pypyspectrafuse/consensus_strategy/binning_strategy.pypyspectrafuse/consensus_strategy/consensus_strategy_base.pypyspectrafuse/consensus_strategy/most_similar_strategy.pypyspectrafuse/incremental/__init__.pypyspectrafuse/incremental/merge_results.pypyspectrafuse/incremental/representative_mgf.pypyspectrafuse/incremental/resolve_clusters.pypyspectrafuse/maracluster_dat.pypyspectrafuse/mgf_convert/parquet2mgf.pypyspectrafuse/pyspectrafuse.pytests/test_incremental.py
👮 Files not reviewed due to content moderation or server errors (10)
- pyspectrafuse/common/msp_utils.py
- pyspectrafuse/common/constant.py
- pyspectrafuse/common/qpx_metadata.py
- pyspectrafuse/consensus_strategy/binning_strategy.py
- pyspectrafuse/pyspectrafuse.py
- pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py
- pyspectrafuse/consensus_strategy/average_spectrum_strategy.py
- pyspectrafuse/common/sdrf_utils.py
- pyspectrafuse/commands/convert_msnet_to_qpx.py
- pyspectrafuse/commands/incremental.py
| enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'], errors='coerce') | ||
| best_idx = enriched.groupby('cluster_id')['_pep'].idxmin() | ||
| best_rows = enriched.loc[best_idx, ['cluster_id', 'run_file_name', 'orig_scan', | ||
| 'peptidoform', 'precursor_mz']].copy() |
There was a problem hiding this comment.
Guard idxmin() against all-NaN PEP groups.
posterior_error_probability is populated through a left join, so a cluster can end up with _pep entirely missing. In that case groupby().idxmin() produces invalid indexers and enriched.loc[best_idx, ...] will fail. Filling _pep with np.inf before the groupby is enough here.
Proposed fix
- enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'], errors='coerce')
+ enriched['_pep'] = pd.to_numeric(
+ enriched['posterior_error_probability'], errors='coerce'
+ ).fillna(np.inf)
best_idx = enriched.groupby('cluster_id')['_pep'].idxmin()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'], errors='coerce') | |
| best_idx = enriched.groupby('cluster_id')['_pep'].idxmin() | |
| best_rows = enriched.loc[best_idx, ['cluster_id', 'run_file_name', 'orig_scan', | |
| 'peptidoform', 'precursor_mz']].copy() | |
| enriched['_pep'] = pd.to_numeric( | |
| enriched['posterior_error_probability'], errors='coerce' | |
| ).fillna(np.inf) | |
| best_idx = enriched.groupby('cluster_id')['_pep'].idxmin() | |
| best_rows = enriched.loc[best_idx, ['cluster_id', 'run_file_name', 'orig_scan', | |
| 'peptidoform', 'precursor_mz']].copy() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyspectrafuse/commands/build_cluster_db.py` around lines 272 - 275, The
groupby().idxmin() call can fail when a cluster's posterior_error_probability is
all NaN; before computing best_idx, coerce posterior_error_probability to
numeric on the enriched DataFrame and replace NaNs with np.inf (e.g., set
enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'],
errors='coerce').fillna(np.inf) or call enriched['_pep'].fillna(np.inf,
inplace=True)) so that enriched.groupby('cluster_id')['_pep'].idxmin() returns
valid positions; then compute best_idx and select best_rows as before (symbols:
enriched, '_pep', posterior_error_probability, idxmin, best_idx, best_rows).
| pool = Pool(processes=os.cpu_count()) | ||
|
|
||
| # Pack all required parameters into a list of tuples | ||
| tasks = [(parquet_file_path, sdrf_file_path, res_file_path, batch_size, spectra_capacity) for parquet_file_path | ||
| in | ||
| parquet_file_path_lst] | ||
| tasks = [(parquet_file_path, sample_info_dict, res_file_path, batch_size, spectra_capacity, skip_instrument) | ||
| for parquet_file_path in parquet_file_path_lst] | ||
|
|
||
| # Use pool.imap to execute tasks in parallel with specified maximum parallel tasks | ||
| for _ in pool.imap(Parquet2Mgf().convert_to_mgf_task, tasks, chunksize=task_parallel): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the file
find . -type f -name "quantmsio2mgf.py" 2>/dev/nullRepository: bigbio/pyspectrafuse
Length of output: 105
🏁 Script executed:
# View the file context around lines 46-53
cat -n pyspectrafuse/commands/quantmsio2mgf.py | sed -n '30,70p'Repository: bigbio/pyspectrafuse
Length of output: 1624
🏁 Script executed:
# Search for task_parallel definition and usage in the file
rg "task_parallel" pyspectrafuse/commands/quantmsio2mgf.py -B 2 -A 2Repository: bigbio/pyspectrafuse
Length of output: 1256
🏁 Script executed:
# Check the function signature to understand parameters
ast-grep --pattern 'def $_($$) {
$$$
}'Repository: bigbio/pyspectrafuse
Length of output: 46
task_parallel is not limiting worker concurrency.
Line 46 creates os.cpu_count() worker processes, while line 53 passes task_parallel as chunksize, which only controls task batching, not the number of workers. So --task_parallel 1 still spawns the full CPU count in parallel. Wire the option into Pool(processes=...) to actually limit worker count, and use chunksize separately for task batching.
Proposed fix
- pool = Pool(processes=os.cpu_count())
-
- # Pack all required parameters into a list of tuples
- tasks = [(parquet_file_path, sample_info_dict, res_file_path, batch_size, spectra_capacity, skip_instrument)
- for parquet_file_path in parquet_file_path_lst]
-
- # Use pool.imap to execute tasks in parallel with specified maximum parallel tasks
- for _ in pool.imap(Parquet2Mgf().convert_to_mgf_task, tasks, chunksize=task_parallel):
- pass
-
- # Close the process pool and wait for all processes to complete
- pool.close()
- pool.join()
+ worker_count = min(task_parallel, os.cpu_count() or 1)
+
+ # Pack all required parameters into a list of tuples
+ tasks = [(parquet_file_path, sample_info_dict, res_file_path, batch_size, spectra_capacity, skip_instrument)
+ for parquet_file_path in parquet_file_path_lst]
+
+ with Pool(processes=worker_count) as pool:
+ for _ in pool.imap(Parquet2Mgf().convert_to_mgf_task, tasks, chunksize=1):
+ pass🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyspectrafuse/commands/quantmsio2mgf.py` around lines 46 - 53, The worker
pool is being created with Pool(processes=os.cpu_count()) which ignores the
--task_parallel option; change the Pool creation to use the task_parallel value
(i.e. Pool(processes=task_parallel)) so that Parquet2Mgf.convert_to_mgf_task
actually runs with the requested number of worker processes, and keep the
existing chunksize parameter for batching (or compute a separate chunksize
variable) when calling pool.imap to control task batching rather than
concurrency.
| output_dir = Path(parquet_dir) / 'msp' / species / instrument / charge | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
| basename = ParquetPathHandler(parquet_dir).get_item_info() | ||
| basename = ParquetPathHandler(path_parquet_lst[0]).get_item_info() |
There was a problem hiding this comment.
Use the project directory for get_item_info(), not the first parquet filename.
Line 198 passes a parquet file path into get_item_info(), but that helper extracts the accession from the last path segment. For typical inputs the last segment is just something.parquet, so this now raises ValueError or derives the wrong basename.
Proposed fix
- basename = ParquetPathHandler(path_parquet_lst[0]).get_item_info()
+ basename = ParquetPathHandler(parquet_dir).get_item_info()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| basename = ParquetPathHandler(path_parquet_lst[0]).get_item_info() | |
| basename = ParquetPathHandler(parquet_dir).get_item_info() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyspectrafuse/commands/spectrum2msp.py` at line 198, The code currently calls
ParquetPathHandler(...).get_item_info() with the first parquet file path
(path_parquet_lst[0]), which extracts the accession from the last path segment
and therefore yields the parquet filename instead of the project basename;
change the call to pass the project directory (e.g., the parent directory of the
parquet files or the existing project_dir variable) into ParquetPathHandler so
get_item_info() reads the directory name as the accession; update the reference
at the call site (where basename is set) to construct the handler with the
folder containing the parquet files rather than the file path itself.
| # Use glob.glob (follows symlinks) instead of Path.rglob (does not) | ||
| for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True): | ||
| parquet_file = Path(path_str) | ||
| name = str(parquet_file) | ||
| if any(name.endswith(s) for s in _metadata_suffixes): | ||
| continue | ||
| if any(ex in parquet_file.stem for ex in _exclude_names): | ||
| continue | ||
| if any(part in _exclude_dirs for part in parquet_file.parts): | ||
| continue | ||
| parquet_path_lst.append(parquet_file) |
There was a problem hiding this comment.
Filter out .parquet directories before returning results.
Line 60 turns every glob hit into a candidate path, but this helper never checks is_file(). Partitioned parquet datasets stored as directories named *.parquet will now be returned here and later consumers will treat them as files.
Proposed fix
# Use glob.glob (follows symlinks) instead of Path.rglob (does not)
for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True):
parquet_file = Path(path_str)
+ if not parquet_file.is_file():
+ continue
name = str(parquet_file)
if any(name.endswith(s) for s in _metadata_suffixes):
continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Use glob.glob (follows symlinks) instead of Path.rglob (does not) | |
| for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True): | |
| parquet_file = Path(path_str) | |
| name = str(parquet_file) | |
| if any(name.endswith(s) for s in _metadata_suffixes): | |
| continue | |
| if any(ex in parquet_file.stem for ex in _exclude_names): | |
| continue | |
| if any(part in _exclude_dirs for part in parquet_file.parts): | |
| continue | |
| parquet_path_lst.append(parquet_file) | |
| # Use glob.glob (follows symlinks) instead of Path.rglob (does not) | |
| for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True): | |
| parquet_file = Path(path_str) | |
| if not parquet_file.is_file(): | |
| continue | |
| name = str(parquet_file) | |
| if any(name.endswith(s) for s in _metadata_suffixes): | |
| continue | |
| if any(ex in parquet_file.stem for ex in _exclude_names): | |
| continue | |
| if any(part in _exclude_dirs for part in parquet_file.parts): | |
| continue | |
| parquet_path_lst.append(parquet_file) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyspectrafuse/common/parquet_utils.py` around lines 58 - 68, The loop
building parquet_path_lst currently appends any glob hit (parquet_file) without
checking whether it's a file, so partitioned datasets stored as directories
ending in .parquet get returned; update the loop in this helper to skip
directories by calling parquet_file.is_file() (or equivalently ensure
pathlib.Path(parquet_file).is_file()) before applying the existing
suffix/name/parts filters and appending to parquet_path_lst, leaving all other
filtering (checks against _metadata_suffixes, _exclude_names, _exclude_dirs)
unchanged.
| rep_mgf_basename = rep_mgf_path.rsplit('/', 1)[-1] if '/' in rep_mgf_path else rep_mgf_path | ||
|
|
||
| rep_mask = cluster_df['mgf_path'].str.contains(rep_mgf_basename, regex=False) | ||
| rep_rows = cluster_df[rep_mask].copy() | ||
| new_rows = cluster_df[~rep_mask].copy() |
There was a problem hiding this comment.
Use exact basename matching for representative rows.
str.contains(rep_mgf_basename, regex=False) will also match unrelated paths like foo_reps.mgf.bak, so those spectra get filtered out of new_rows and can inherit the wrong resolved_cluster_id. Using Path(...).name == rep_mgf_basename avoids both the false positives and the current /-only path split.
Proposed fix
+from pathlib import Path
+
- rep_mgf_basename = rep_mgf_path.rsplit('/', 1)[-1] if '/' in rep_mgf_path else rep_mgf_path
-
- rep_mask = cluster_df['mgf_path'].str.contains(rep_mgf_basename, regex=False)
+ rep_mgf_basename = Path(rep_mgf_path).name
+ rep_mask = cluster_df['mgf_path'].map(lambda p: Path(str(p)).name == rep_mgf_basename)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rep_mgf_basename = rep_mgf_path.rsplit('/', 1)[-1] if '/' in rep_mgf_path else rep_mgf_path | |
| rep_mask = cluster_df['mgf_path'].str.contains(rep_mgf_basename, regex=False) | |
| rep_rows = cluster_df[rep_mask].copy() | |
| new_rows = cluster_df[~rep_mask].copy() | |
| rep_mgf_basename = Path(rep_mgf_path).name | |
| rep_mask = cluster_df['mgf_path'].map(lambda p: Path(str(p)).name == rep_mgf_basename) | |
| rep_rows = cluster_df[rep_mask].copy() | |
| new_rows = cluster_df[~rep_mask].copy() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyspectrafuse/incremental/resolve_clusters.py` around lines 99 - 103, The
current filtering uses rep_mgf_basename and str.contains, which yields false
positives; instead compute exact basenames using pathlib.Path and do an equality
match: derive rep_mgf_basename from Path(rep_mgf_path).name, compare against
cluster_df['mgf_path'].map(lambda p: Path(p).name) (or add a temporary column of
basenames) to build rep_mask, then set rep_rows/new_rows accordingly; ensure
pathlib.Path is imported if not already.
| def _get_mgf_path(row): | ||
| fname = Parquet2Mgf.get_filename_from_usi(row) | ||
| info = sample_info_dict.get(fname) | ||
| if info is None: | ||
| logger.warning(f"No metadata for run file: {fname}") | ||
| info = ['Unknown', 'Unknown'] | ||
| return '/'.join(info + ['charge' + str(row["charge"]), 'mgf files']) | ||
|
|
||
| mgf_group_df['mgf_file_path'] = row_group.apply( | ||
| lambda row: '/'.join(sample_info_dict.get(Parquet2Mgf.get_filename_from_usi(row)) + | ||
| ['charge' + str(row["charge"]), 'mgf files']), axis=1) | ||
| mgf_group_df['mgf_file_path'] = row_group.apply(_get_mgf_path, axis=1) |
There was a problem hiding this comment.
skip_instrument is currently a no-op.
Lines 94-96 advertise a mode that should collapse output under a shared instrument bucket, but _get_mgf_path() always uses the instrument from sample_info_dict. As written, --skip_instrument still writes into per-instrument folders.
Proposed fix
def _get_mgf_path(row):
fname = Parquet2Mgf.get_filename_from_usi(row)
info = sample_info_dict.get(fname)
if info is None:
logger.warning(f"No metadata for run file: {fname}")
info = ['Unknown', 'Unknown']
+ if skip_instrument:
+ info = [info[0], 'all_instruments']
return '/'.join(info + ['charge' + str(row["charge"]), 'mgf files'])🧰 Tools
🪛 Ruff (0.15.9)
[warning] 161-161: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyspectrafuse/mgf_convert/parquet2mgf.py` around lines 155 - 163,
_get_mgf_path currently always pulls instrument from sample_info_dict, ignoring
the skip_instrument option; update _get_mgf_path to check the skip_instrument
flag (the variable/attribute used by Parquet2Mgf for that option) and, when
True, use a single shared instrument bucket (e.g., a fixed name like
"all_instruments" or an empty instrument segment) instead of info[0]; modify the
function where mgf_group_df['mgf_file_path'] is set so _get_mgf_path reads
skip_instrument and constructs the path accordingly (reference function name
_get_mgf_path, the sample_info_dict lookup, and Parquet2Mgf.skip_instrument or
the local skip_instrument variable).
| tsv_content = ( | ||
| f"reps.mgf\t0\tNEW_A\n" | ||
| f"reps.mgf\t1\tNEW_B\n" | ||
| f"new_data.mgf\t0\tNEW_A\n" | ||
| f"new_data.mgf\t1\tNEW_C\n" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and read the test_incremental.py file
find . -name "test_incremental.py" -type fRepository: bigbio/pyspectrafuse
Length of output: 91
🏁 Script executed:
# Read the specific lines mentioned in the review
wc -l tests/test_incremental.pyRepository: bigbio/pyspectrafuse
Length of output: 93
🏁 Script executed:
# Read lines 150-155 and 193-196 from the test file
sed -n '150,155p; 193,196p' tests/test_incremental.pyRepository: bigbio/pyspectrafuse
Length of output: 357
🏁 Script executed:
# Double-check by searching for any interpolation patterns in these strings
sed -n '150,155p; 193,196p' tests/test_incremental.py | grep -E '\{.*\}' || echo "No placeholders found"Repository: bigbio/pyspectrafuse
Length of output: 85
Remove the unused f prefixes in these fixture strings.
These f-string literals contain no placeholders, so Ruff flags them with F541 and the test module will fail lint in its current form.
Proposed fix
- f"reps.mgf\t0\tNEW_A\n"
- f"reps.mgf\t1\tNEW_B\n"
- f"new_data.mgf\t0\tNEW_A\n"
- f"new_data.mgf\t1\tNEW_C\n"
+ "reps.mgf\t0\tNEW_A\n"
+ "reps.mgf\t1\tNEW_B\n"
+ "new_data.mgf\t0\tNEW_A\n"
+ "new_data.mgf\t1\tNEW_C\n"- f"reps.mgf\t0\tMERGED\n"
- f"reps.mgf\t1\tMERGED\n"
+ "reps.mgf\t0\tMERGED\n"
+ "reps.mgf\t1\tMERGED\n"Also applies to: 193-196
🧰 Tools
🪛 Ruff (0.15.9)
[error] 151-151: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 152-152: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 153-153: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 154-154: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_incremental.py` around lines 150 - 155, The fixture string
assignments for the variable tsv_content use unnecessary f-string prefixes
(e.g., f"reps.mgf\t0\tNEW_A\n" blocks) which trigger Ruff F541; remove the
leading f from each quoted literal (including the second occurrence around lines
193-196) so they become plain string literals while preserving the content and
escape sequences.
Parquet-to-dat is the standard conversion path, not a bypass of anything. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix test_get_metadata_dict_from_sdrf: column name case mismatch (Characteristics[organism] -> characteristics[organism]) - Switch from setuptools to hatchling build backend - Bump version to 0.0.4, require Python >=3.10 - Update Dockerfile to use uv for faster installs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace per-row iterrows() with columnar array access - Buffer dat/scaninfo/title writes per batch instead of per spectrum - Pre-extract columns (charges, mz, runs, etc.) outside inner loop - Same output, 3-5x faster on large datasets Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Dockerfile`:
- Around line 39-44: The Docker build fails because pyproject.toml declares
readme = "README.md" but README.md is not copied before running RUN uv pip
install --system --no-cache . Update the Dockerfile so that README.md is copied
into the image before the install step (i.e., place a COPY README.md . between
the COPY pyproject.toml . and COPY pyspectrafuse/ ./ lines) so the package
installation can read the declared README during build.
- Line 34: Replace the non-reproducible image reference used as a build stage
source in the COPY statement (currently "ghcr.io/astral-sh/uv:latest" in the
COPY --from=... line) with an explicit version tag and its SHA256 digest (e.g.,
ghcr.io/astral-sh/uv:<version>@sha256:<digest>); obtain the correct tag and
digest using docker manifest inspect (or the uv Docker integration docs) and
update the COPY --from=... source to that pinned reference to ensure
reproducible, verifiable builds.
In `@pyspectrafuse/maracluster_dat.py`:
- Around line 147-149: The code defines a columns list including 'rt' but still
writes Spectrum records with retentionTime=0.0; update the Spectrum/dict
construction to use the parsed rt value instead of the hard-coded 0.0 (e.g.,
retentionTime=rt or retentionTime=row['rt']), and make the same change at the
other spot flagged (the second Spectrum/dict creation). Ensure the 'rt' value is
cast to float if necessary before assigning to retentionTime so the parquet
persists the actual retention time.
- Around line 196-218: The code converts mz_array and intensity_array to numpy
arrays then calls _bin_spectrum but doesn't verify they have the same length;
when they differ _bin_spectrum may index out-of-bounds. Before computing
charge/prec_mz/etc., add a validation step comparing lengths of mz_arr and
int_arr (and that both lengths > 0) and skip the spectrum (increment skipped) if
they differ or are zero-length; reference the variables mz_arr, int_arr and the
called function _bin_spectrum when adding this guard so only well-formed arrays
reach _bin_spectrum.
- Around line 264-279: The converter is currently grabbing any ".parquet" files
(via find_target_ext_files) and produces .dat outputs for non-spectrum files;
restrict it to spectrum inputs by changing the finder/filter to only match
".psm.parquet" (e.g., call find_target_ext_files(parquet_dir, '.psm.parquet') or
post-filter parquet_files to keep only names ending with '.psm.parquet'), and
keep the FileNotFoundError when the filtered list is empty; reference the
existing parquet_files variable and the find_target_ext_files call so the filter
change is localized and prevents producing .dat for cluster metadata or
membership files.
In `@README.md`:
- Line 162: The fenced code block at the README's project tree is missing a
language identifier (triggering markdownlint MD040); update the opening fence
from ``` to ```text (or another appropriate language like ```bash) so the block
becomes labeled (e.g., change the block around the pyspectrafuse tree to start
with ```text) and leave the closing ``` as-is to satisfy the linter and improve
rendering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9eb047c6-ab59-4739-9146-a66082074b31
📒 Files selected for processing (5)
DockerfileREADME.mdpyproject.tomlpyspectrafuse/maracluster_dat.pytests/conftest.py
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| ## Install uv | ||
| COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the recommended secure Docker pattern for ghcr.io/astral-sh/uv (pinning by explicit version and sha256 digest)?
💡 Result:
The recommended secure Docker pattern for ghcr.io/astral-sh/uv is to pin to an explicit version tag and further pin by SHA256 digest for reproducibility and security against tag movement. Official uv documentation provides this example in a multi-stage Dockerfile: FROM python:3.12-slim-trixie COPY --from=ghcr.io/astral-sh/uv:0.11.6 /uv /uvx /bin/ For maximum security, replace the tag with @sha256:: COPY --from=ghcr.io/astral-sh/uv@sha256:2381d6aa60c326b71fd40023f921a0a3b8f91b14d5db6b90402e65a635053709 /uv /uvx /bin/ To obtain the current digest for a specific tag (e.g., :0.4.17), run: docker manifest inspect ghcr.io/astral-sh/uv:0.4.17 (look for "digest" under the platform config). Available tags include :latest, :{version} (e.g., 0.11.6), and variants like :debian, :python3.12-trixie. View all at https://ghcr.io/astral-sh/uv. Images are signed with GitHub attestations; verify with gh attestation verify oci://ghcr.io/astral-sh/uv:0.11.6. This ensures supply chain security by preventing image substitution.
Citations:
- 1: https://docs.astral.sh/uv/guides/integration/docker/
- 2: Docker documentation should cover pinning with sha256 astral-sh/uv#6565
- 3: [docs/integration/docker] add sha pinning tip astral-sh/uv#10955
- 4: https://ghcr.io/astral-sh/uv
- 5: Publish Docker image: ghcr.io/astral-sh/uv astral-sh/uv#2774
Pin the uv image by explicit version tag and SHA256 digest.
The ghcr.io/astral-sh/uv:latest tag is non-reproducible and weakens supply-chain control. Use an explicit version tag combined with SHA256 digest pinning to ensure build reproducibility and prevent image substitution attacks.
Proposed change
-COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
+COPY --from=ghcr.io/astral-sh/uv:<pinned-version>@sha256:<pinned-digest> /uv /uvx /bin/Consult the official uv Docker integration guide for available version tags and instructions to retrieve the corresponding digest using docker manifest inspect.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Dockerfile` at line 34, Replace the non-reproducible image reference used as
a build stage source in the COPY statement (currently
"ghcr.io/astral-sh/uv:latest" in the COPY --from=... line) with an explicit
version tag and its SHA256 digest (e.g.,
ghcr.io/astral-sh/uv:<version>@sha256:<digest>); obtain the correct tag and
digest using docker manifest inspect (or the uv Docker integration docs) and
update the COPY --from=... source to that pinned reference to ensure
reproducible, verifiable builds.
| ## Copy project files | ||
| COPY pyproject.toml . | ||
| COPY pyspectrafuse/ ./pyspectrafuse/ | ||
|
|
||
| ## Install the package | ||
| RUN pip install --no-cache-dir -e . | ||
| ## Install the package with uv | ||
| RUN uv pip install --system --no-cache . |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Checking declared readme in pyproject.toml:"
rg -n '^\s*readme\s*=' pyproject.toml
echo
echo "Checking copied files in Dockerfile:"
rg -n '^\s*COPY\s+' DockerfileRepository: bigbio/pyspectrafuse
Length of output: 291
Add missing README.md to Docker build context before install.
When uv pip install . runs from the local directory, it reads pyproject.toml which declares readme = "README.md" at line 9. The build will fail because README.md is not copied into the Docker image before the install command executes. Add the missing file between the pyproject.toml and pyspectrafuse/ copies:
Fix
## Copy project files
COPY pyproject.toml .
+COPY README.md .
COPY pyspectrafuse/ ./pyspectrafuse/
## Install the package with uv
RUN uv pip install --system --no-cache .📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Copy project files | |
| COPY pyproject.toml . | |
| COPY pyspectrafuse/ ./pyspectrafuse/ | |
| ## Install the package | |
| RUN pip install --no-cache-dir -e . | |
| ## Install the package with uv | |
| RUN uv pip install --system --no-cache . | |
| ## Copy project files | |
| COPY pyproject.toml . | |
| COPY README.md . | |
| COPY pyspectrafuse/ ./pyspectrafuse/ | |
| ## Install the package with uv | |
| RUN uv pip install --system --no-cache . |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Dockerfile` around lines 39 - 44, The Docker build fails because
pyproject.toml declares readme = "README.md" but README.md is not copied before
running RUN uv pip install --system --no-cache . Update the Dockerfile so that
README.md is copied into the image before the install step (i.e., place a COPY
README.md . between the COPY pyproject.toml . and COPY pyspectrafuse/ ./ lines)
so the package installation can read the declared README during build.
| columns = ['mz_array', 'intensity_array', 'charge', | ||
| 'observed_mz', 'calculated_mz', 'rt', 'scan', | ||
| 'run_file_name', 'peptidoform'] |
There was a problem hiding this comment.
Persist the parquet RT instead of hard-coding 0.0.
rt is loaded here, but every Spectrum record is still written with retentionTime=0. Any MaRaCluster mode that consumes RT from the .dat payload will see all spectra at time zero.
Proposed fix
runs = df['run_file_name'].values if 'run_file_name' in df.columns else [''] * len(df)
peps = df['peptidoform'].values if 'peptidoform' in df.columns else [''] * len(df)
scans = df['scan'].values if 'scan' in df.columns else [0] * len(df)
+ rts = df['rt'].values if 'rt' in df.columns else np.zeros(len(df), dtype=np.float32)
...
+ rt = rts[i]
+ rt = 0.0 if rt is None or np.isnan(rt) else float(rt)
dat_buf.extend(_pack_spectrum(
- file_idx, scannr, charge, prec_mz, 0.0, peak_bins))
+ file_idx, scannr, charge, prec_mz, rt, peak_bins))Also applies to: 226-227
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyspectrafuse/maracluster_dat.py` around lines 147 - 149, The code defines a
columns list including 'rt' but still writes Spectrum records with
retentionTime=0.0; update the Spectrum/dict construction to use the parsed rt
value instead of the hard-coded 0.0 (e.g., retentionTime=rt or
retentionTime=row['rt']), and make the same change at the other spot flagged
(the second Spectrum/dict creation). Ensure the 'rt' value is cast to float if
necessary before assigning to retentionTime so the parquet persists the actual
retention time.
| mz_arr = np.asarray(mz_arr, dtype=np.float64) | ||
| int_arr = np.asarray(int_arr, dtype=np.float64) | ||
|
|
||
| if len(mz_arr) == 0 or len(int_arr) == 0: | ||
| skipped += 1 | ||
| continue | ||
|
|
||
| charge = int(charges[i]) | ||
| if charge <= 0: | ||
| skipped += 1 | ||
| continue | ||
|
|
||
| # Precursor m/z: prefer observed_mz, fall back to calculated_mz | ||
| prec_mz = obs_mz[i] if obs_mz is not None else None | ||
| if prec_mz is None or np.isnan(prec_mz): | ||
| prec_mz = calc_mz[i] if calc_mz is not None else 0.0 | ||
| prec_mz = float(prec_mz) | ||
|
|
||
| # Neutral mass for peak filtering | ||
| prec_mass = prec_mz * charge - PROTON_MASS * (charge - 1) | ||
|
|
||
| # Bin the spectrum | ||
| peak_bins = _bin_spectrum(mz_arr, int_arr, prec_mass) |
There was a problem hiding this comment.
Validate fragment array lengths before binning.
If a parquet row has different-length mz_array and intensity_array, _bin_spectrum() can index past the end of mz_array and kill the whole conversion instead of skipping one bad spectrum.
Proposed fix
- if len(mz_arr) == 0 or len(int_arr) == 0:
+ if len(mz_arr) == 0 or len(int_arr) == 0 or len(mz_arr) != len(int_arr):
skipped += 1
continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mz_arr = np.asarray(mz_arr, dtype=np.float64) | |
| int_arr = np.asarray(int_arr, dtype=np.float64) | |
| if len(mz_arr) == 0 or len(int_arr) == 0: | |
| skipped += 1 | |
| continue | |
| charge = int(charges[i]) | |
| if charge <= 0: | |
| skipped += 1 | |
| continue | |
| # Precursor m/z: prefer observed_mz, fall back to calculated_mz | |
| prec_mz = obs_mz[i] if obs_mz is not None else None | |
| if prec_mz is None or np.isnan(prec_mz): | |
| prec_mz = calc_mz[i] if calc_mz is not None else 0.0 | |
| prec_mz = float(prec_mz) | |
| # Neutral mass for peak filtering | |
| prec_mass = prec_mz * charge - PROTON_MASS * (charge - 1) | |
| # Bin the spectrum | |
| peak_bins = _bin_spectrum(mz_arr, int_arr, prec_mass) | |
| mz_arr = np.asarray(mz_arr, dtype=np.float64) | |
| int_arr = np.asarray(int_arr, dtype=np.float64) | |
| if len(mz_arr) == 0 or len(int_arr) == 0 or len(mz_arr) != len(int_arr): | |
| skipped += 1 | |
| continue | |
| charge = int(charges[i]) | |
| if charge <= 0: | |
| skipped += 1 | |
| continue | |
| # Precursor m/z: prefer observed_mz, fall back to calculated_mz | |
| prec_mz = obs_mz[i] if obs_mz is not None else None | |
| if prec_mz is None or np.isnan(prec_mz): | |
| prec_mz = calc_mz[i] if calc_mz is not None else 0.0 | |
| prec_mz = float(prec_mz) | |
| # Neutral mass for peak filtering | |
| prec_mass = prec_mz * charge - PROTON_MASS * (charge - 1) | |
| # Bin the spectrum | |
| peak_bins = _bin_spectrum(mz_arr, int_arr, prec_mass) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyspectrafuse/maracluster_dat.py` around lines 196 - 218, The code converts
mz_array and intensity_array to numpy arrays then calls _bin_spectrum but
doesn't verify they have the same length; when they differ _bin_spectrum may
index out-of-bounds. Before computing charge/prec_mz/etc., add a validation step
comparing lengths of mz_arr and int_arr (and that both lengths > 0) and skip the
spectrum (increment skipped) if they differ or are zero-length; reference the
variables mz_arr, int_arr and the called function _bin_spectrum when adding this
guard so only well-formed arrays reach _bin_spectrum.
| """Convert all .psm.parquet files in a directory to .dat format. | ||
|
|
||
| Args: | ||
| parquet_dir: Directory containing .psm.parquet files | ||
| output_dir: Directory to write .dat files | ||
| file_idx_start: Starting file index | ||
| charge_filter: If set, only include spectra with this charge | ||
|
|
||
| Returns: | ||
| Tuple of (list of .dat file paths, total spectra written) | ||
| """ | ||
| from pyspectrafuse.commands.spectrum2msp import find_target_ext_files | ||
|
|
||
| parquet_files = find_target_ext_files(parquet_dir, '.parquet') | ||
| if not parquet_files: | ||
| raise FileNotFoundError(f"No parquet files found in {parquet_dir}") |
There was a problem hiding this comment.
Restrict dataset conversion to the spectrum parquet inputs.
The docstring says this path converts .psm.parquet files, but find_target_ext_files(parquet_dir, '.parquet') will also pick up outputs like cluster_metadata.parquet and psm_cluster_membership.parquet. Those files do not have the spectrum arrays this converter expects, so this silently generates useless .dat artifacts for non-spectrum inputs.
Proposed fix
- parquet_files = find_target_ext_files(parquet_dir, '.parquet')
+ parquet_files = find_target_ext_files(parquet_dir, '.psm.parquet')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyspectrafuse/maracluster_dat.py` around lines 264 - 279, The converter is
currently grabbing any ".parquet" files (via find_target_ext_files) and produces
.dat outputs for non-spectrum files; restrict it to spectrum inputs by changing
the finder/filter to only match ".psm.parquet" (e.g., call
find_target_ext_files(parquet_dir, '.psm.parquet') or post-filter parquet_files
to keep only names ending with '.psm.parquet'), and keep the FileNotFoundError
when the filtered list is empty; reference the existing parquet_files variable
and the find_target_ext_files call so the filter change is localized and
prevents producing .dat for cluster metadata or membership files.
| ## Project Structure | ||
|
|
||
| Run the test suite: | ||
| ``` |
There was a problem hiding this comment.
Add a language identifier to the fenced block.
Line 162 uses a fenced code block without a language, which triggers markdownlint MD040.
Proposed fix
-```
+```text
pyspectrafuse/
├── maracluster_dat.py # Parquet to .dat binary conversion
...
└── mgf_convert/ # MGF conversion utilities</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 162-162: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` at line 162, The fenced code block at the README's project tree is
missing a language identifier (triggering markdownlint MD040); update the
opening fence from ``` to ```text (or another appropriate language like ```bash)
so the block becomes labeled (e.g., change the block around the pyspectrafuse
tree to start with ```text) and leave the closing ``` as-is to satisfy the
linter and improve rendering.
Summary
maracluster_dat.py: parquet-to-dat conversion bypassing MGF entirely (15x smaller output)build-cluster-dbandparquet2datCLI commandsmerge-clusterswith BIN strategy supportqpx_metadata.py) and MSNet-to-QPX converterDetails
The dat bypass writes MaRaCluster's internal binary format directly from parquet. Binning replicates
BinSpectra::binBinaryTruncated:bin = floor(mz / 1.000508 + 0.32), top-40 peaks, skip peaks >= precursor neutral mass. Size: ~100 bytes/spectrum vs ~1.5 KB in MGF.The
build-cluster-dbcommand creates cluster_metadata.parquet + psm_cluster_membership.parquet from MaRaCluster output + scan_titles mapping, designed for the dat-bypass workflow.Test plan
python test_e2e_single_dataset.pyon PXD014877pyspectrafuse convert-datproduces valid .dat filespyspectrafuse build-cluster-dbend-to-end--method_type bin🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Refactor
Documentation
Tests